Triton 编程入门：并行执行模型—

从串行 CPU 编程转向 GPU 编程，需要一次范式转变：从逐元素迭代转变为 基于块的执行。我们不再将数据视为标量流，而是将其看作由“块”组成的集合，这些块被调度以充分利用硬件带宽。

一个内核的性能瓶颈取决于数学运算与内存访问次数的比率。向量加法通常属于内存受限型因为它每进行三次内存操作（两次加载，一次存储）才执行一次加法。硬件花费在等待 DRAM 数据上的时间远多于实际计算时间。

BLOCK_SIZE 定义了并行性的粒度。如果过小，我们将无法充分利用 GPU 宽广的执行通道。一个合适的大小能确保有足够多的“飞行中工作”来饱和内存总线。

占用率 指 GPU 上活动块的数量。虽然这不是最终目标，但它允许调度器在某个块等待从显存获取高延迟数据时，切换到另一个块进行计算。

为了最大化性能，我们必须使我们的 BLOCK_SIZE 与 GPU 架构的内存合并规则对齐，确保连续线程访问连续的内存地址。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For a kernel that adds two vectors ($out = x + y$), what is the most likely bottleneck on modern GPUs?

Arithmetic Throughput

Memory Bandwidth

Shared Memory Latency

QUESTION 2

What is the primary purpose of 'Occupancy' in the GPU execution model?

To ensure every thread runs as fast as possible.

To hide memory latency by keeping work in flight.

To increase the clock speed of the compute units.

To reduce the power consumption of the HBM.

QUESTION 3

Which of the following describes 'Memory-Bound' behavior?

The GPU is waiting for the memory bus to deliver data.

The GPU has exhausted its available VRAM.

The kernel is performing too many complex floating-point operations.

The CPU cannot launch kernels fast enough.

QUESTION 4

What happens if the BLOCK_SIZE is set too small?

The kernel will fail with a memory error.

The GPU fails to utilize its wide SIMD execution lanes.

The memory bandwidth increases significantly.

QUESTION 5

In the logistics warehouse analogy, what represents the 'Blocks'?

The individual items.

The workers.

The organized pallets.

The delivery trucks.